Obesity has become a global health crisis, contributing to rising rates of non-communicable diseases and placing significant strain on healthcare systems worldwide. In this study, we explore the factors influencing obesity in Mexico, Peru, and Colombia through an analysis of a mixed dataset consisting of 77% synthetically generated data and 23% real-world data from 498 participants. Using data cleaning, visualization, and advanced modeling techniques, we identify key lifestyle and behavioral contributors to obesity, including dietary habits, physical activity, and demographic factors. The study employs linear regression to predict BMI and logistic regression to classify obesity, providing valuable insights into the relationship between these factors and obesity levels. While the findings are limited by the use of synthetic data and a non-representative sample, they underscore the importance of data-driven approaches in addressing public health challenges. This report aims to demonstrate the application of theoretical knowledge in a simulated environment and lay the groundwork for future studies targeting obesity reduction strategies.
Obesity has emerged as one of the most pressing global health crises, with its prevalence nearly tripling worldwide since 1975, according to the World Health Organization (WHO). This alarming trend has fueled a dramatic rise in obesity-related diseases, including diabetes, cardiovascular conditions, and hypertension, imposing significant burdens on healthcare systems and economies. In Latin America and the Caribbean, the situation is particularly concerning: as of 2022, the Pan American Health Organization (PAHO) reported that nearly 25% of adults in the region are affected by obesity, emphasizing the urgent need for effective public health interventions. The crisis is especially acute in the countries central to this research. In 2018, Mexico recorded an adult obesity rate of 36.1%, while Peru and Colombia reported similarly worrisome rates of approximately 28% and 23%, respectively.
This widespread prevalence underscores the critical need for research focused on understanding and addressing the multifaceted factors contributing to obesity. In this context, the present study adopts an exploratory and primarily educational approach to examine the relationships between dietary habits, physical activity, and demographic variables, aiming to uncover their impact on obesity levels in Mexico, Peru, and Colombia. By leveraging a dataset consisting of 77% synthetically generated data (produced via the SMOTE algorithm) and 23% user-collected data from 498 participants, the research seeks to provide meaningful insights into this complex issue.
While the reliance on synthetic data and a non-representative sample limits direct real-world applicability, this study offers a unique opportunity to apply theoretical knowledge gained during the “Data Science in Business Analytics” course to a simulated scenario. By identifying patterns, correlations, and potential predictors of obesity, the research highlights the importance of data-driven approaches in addressing significant public health challenges. Ultimately, the findings aim to lay the groundwork for future studies and contribute to the development of informed public health strategies and healthcare policies, demonstrating the transformative potential of data analytics in managing and mitigating complex issues.
1.2 Research Questions
Question 1
What are the key lifestyle and behavioral factors that significantly contribute to obesity in Mexico, Peru, and Colombia?
Question 2
Can we predict whether a person will be obese based on some given combinations of factors?
Question 3
How can these insights be effectively leveraged to inform public health initiatives and combat the escalating health crisis?
2 Data
2.1 Sources
The dataset utilized in this project was obtained from the UCI Machine Learning Repository, a reputable and extensively used platform for data science and machine learning projects. Originally compiled by researchers at the Universidad de la Costa, Colombia, the dataset combines 77% synthetically generated data with 23% real-world data collected through a structured online survey. The synthetic data, created using the Synthetic Minority Over-sampling Technique (SMOTE) in Weka, addresses class imbalance, enhancing the dataset’s suitability for machine learning tasks. The real-world data, gathered from 498 participants over a 30-day period, captures detailed self-reported information on dietary habits, physical activity levels, and demographic characteristics. While synthetic data introduces uniformity and balance, it inherently lacks the complexity of real-world variability, and the user-collected data, though authentic, is susceptible to self-reporting biases and sampling limitations. These characteristics, along with the dataset’s diverse origins, make it an invaluable resource for simulating real-world challenges in healthcare analytics.
2.2 Description
The dataset consists of 2111 records and 17 attributes, offering a detailed examination of the factors contributing to obesity. The attributes represent a mix of categorical and continuous variables, providing insights into demographic, lifestyle, and behavioral factors. In greater detail, an interactive table was designed to provide a comprehensive summary of the dataset’s variables.
Code
library(here)library(knitr)# Main features of the datasetdataset_raw <-read.csv(here("data/raw/dataset_raw.csv"))head(dataset_raw) %>%kbl(format ="html", caption ="First 6 Rows of the Raw Dataset") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width =TRUE) %>%row_spec(0, bold =TRUE, background ="#f0f0f0") %>%scroll_box(width ="100%", height ="400px")
First 6 Rows of the Raw Dataset
Gender
Age
Height
Weight
family_history_with_overweight
FAVC
FCVC
NCP
CAEC
SMOKE
CH2O
SCC
FAF
TUE
CALC
MTRANS
NObeyesdad
Female
21
1.62
64.0
yes
no
2
3
Sometimes
no
2
no
0
1
no
Public_Transportation
Normal_Weight
Female
21
1.52
56.0
yes
no
3
3
Sometimes
yes
3
yes
3
0
Sometimes
Public_Transportation
Normal_Weight
Male
23
1.80
77.0
yes
no
2
3
Sometimes
no
2
no
2
1
Frequently
Public_Transportation
Normal_Weight
Male
27
1.80
87.0
no
no
3
3
Sometimes
no
2
no
2
0
Frequently
Walking
Overweight_Level_I
Male
22
1.78
89.8
no
no
2
1
Sometimes
no
2
no
0
0
Sometimes
Public_Transportation
Overweight_Level_II
Male
29
1.62
53.0
no
yes
2
3
Sometimes
no
2
no
0
0
Sometimes
Automobile
Normal_Weight
Code
val_meaning <-c("Indicates the gender of the individual (Male/Female).", "Represents the age of participants in years.", "The height of individuals in meters.", "The weight of participants in kilograms.", "Indicates whether a family member has suffered from overweight (Yes/No).", "Indicates if participants frequently consume high-caloric foods (Yes/No).", "Scaled from 1 to 3, reflects how often vegetables are consumed (1 = Never, 3 = Always).", "Indicates the typical number of main meals consumed daily.", "Describes how often participants eat between meals (e.g., No, Sometimes, Frequently, Always).", "Indicates whether participants smoke (Yes/No).", "Scaled from 1 to 3, reflecting daily water intake (1 = Less than 1 liter, 3 = More than 2 liters).", "Whether participants monitor their calorie intake (Yes/No).", "Scaled from 0 to 4, indicating days of physical activity per week (0 = None, 4 = 4-5 days).", "Reflects daily time spent on technological devices, in hours.", "Indicates the frequency of alcohol consumption (e.g., I don't drink, Sometimes, Frequently, Always).", "Describes the primary mode of transportation (e.g., Walking, Public Transportation, Automobile).", "The target variable, classifying obesity levels into categories such as Normal Weight, Overweight (Levels I and II), and Obesity (Types I, II, III).")desc_table <- tibble::tibble(Name =colnames(dataset_raw), Type =sapply(dataset_raw, class), Meaning = val_meaning)desc_table %>%kbl(format ="html", caption ="Variable Descriptions") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width =TRUE) %>%row_spec(0, bold =TRUE, background ="#f0f0f0") %>%column_spec(1, bold =TRUE, width ="200px") %>%column_spec(2, width ="150px") %>%column_spec(3, width ="500px") %>%scroll_box(width ="100%", height ="500px")
Variable Descriptions
Name
Type
Meaning
Gender
character
Indicates the gender of the individual (Male/Female).
Age
numeric
Represents the age of participants in years.
Height
numeric
The height of individuals in meters.
Weight
numeric
The weight of participants in kilograms.
family_history_with_overweight
character
Indicates whether a family member has suffered from overweight (Yes/No).
FAVC
character
Indicates if participants frequently consume high-caloric foods (Yes/No).
FCVC
numeric
Scaled from 1 to 3, reflects how often vegetables are consumed (1 = Never, 3 = Always).
NCP
numeric
Indicates the typical number of main meals consumed daily.
CAEC
character
Describes how often participants eat between meals (e.g., No, Sometimes, Frequently, Always).
SMOKE
character
Indicates whether participants smoke (Yes/No).
CH2O
numeric
Scaled from 1 to 3, reflecting daily water intake (1 = Less than 1 liter, 3 = More than 2 liters).
SCC
character
Whether participants monitor their calorie intake (Yes/No).
FAF
numeric
Scaled from 0 to 4, indicating days of physical activity per week (0 = None, 4 = 4-5 days).
TUE
numeric
Reflects daily time spent on technological devices, in hours.
CALC
character
Indicates the frequency of alcohol consumption (e.g., I don't drink, Sometimes, Frequently, Always).
MTRANS
character
Describes the primary mode of transportation (e.g., Walking, Public Transportation, Automobile).
NObeyesdad
character
The target variable, classifying obesity levels into categories such as Normal Weight, Overweight (Levels I and II), and Obesity (Types I, II, III).
The dataset underwent a thorough preprocessing phase, including normalization of continuous variables, encoding of categorical data, and removal of missing or atypical entries to ensure high-quality analysis. Class imbalance was addressed using the SMOTE (Synthetic Minority Oversampling Technique), generating synthetic data while carefully avoiding noise or artificial patterns. The final dataset comprises 77% synthetic data, which enhances balance and diversity, and 23% real-world data, adding authenticity. This combination allows for a comprehensive analysis of obesity-related factors, while recognizing potential biases, such as inaccuracies in self-reported information.
2.3 Wrangling
Essential libraries for data manipulation, visualization, and clustering are loaded to begin the wrangling process and support subsequent analysis. Each package is utilized for its specific functionality, facilitating efficient and streamlined analysis:
dplyr: for data manipulation (e.g., filtering, summarizing);
tidyr: for data tidying (e.g., reshaping);
ggplot2: for visualization;
corrplot: for correlation matrix visualization;
ggridges: for creating ridge plots;
cluster: for clustering algorithms;
reshape2: for data reshaping, especially during visualization.
Column names are renamed to enhance clarity and improve usability during the analysis. The updated names are designed to be shorter and more intuitive, ensuring ease of reference while retaining their original meaning and context. This adjustment simplifies code readability and helps streamline data manipulation tasks, particularly in complex analytical workflows.
The structure of the dataset is examined to identify the data types of each variable, providing critical insights for subsequent data preparation. Understanding the data types helps pinpoint columns requiring transformations, such as converting categorical variables to factors or standardizing numeric variables for analysis.
The transformations ensured the dataset was ready for analysis by restructuring categorical and ordinal variables to meet modeling requirements. Converting categorical variables into factors standardized their representation, reducing ambiguity and improving compatibility with statistical models. For ordinal variables, levels were explicitly ordered to preserve their logical progression and enhance interpretability, allowing for meaningful comparisons across categories.
The updated structure was reviewed to confirm the accuracy of these adjustments, providing confidence in the preprocessing steps. While further transformations like normalization were not applied, the focus on categorical and ordinal adjustments established a strong foundation for reliable and interpretable analysis. In particular, the levels of obesity categories, food consumption between meals, and frequency of alcohol use were arranged to reflect increasing severity or frequency, ensuring these variables captured their intended relationships and supported clear, accurate insights into the data.
Now, a numerical version of the dataset, called “dataset_num”, is created by transforming categorical variables into numerical values, ensuring compatibility with statistical analyses while maintaining logical relationships and interpretability. This numerical transformation is specifically essential for developing the correlation matrix, as it requires all variables to be in numeric format to analyze their relationships effectively.
The presence of potential missing values in the transformed dataset is checked and visualized to confirm data integrity and ensure no issues have been introduced during the conversion process.
nb_na<-colSums(is.na(dataset_num))nb_na %>%kbl(format ="html", caption ="Presence of Potential NA Values in the Dataset") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width =FALSE, position ="left" ) %>%column_spec(1, width ="100px") %>%column_spec(2, width ="80px") %>%row_spec(0, bold =TRUE, background ="#f5f5f5") %>%scroll_box(width ="100%", height ="400px")
Presence of Potential NA Values in the Dataset
x
gender
0
age
0
height
0
weight
0
family_hist
0
caloric_food
0
vegetable_food
0
nb_meal_day
0
food_btw_meals
0
smoke
0
ch2o
0
calorie_check
0
physical_act
0
use_tech
0
freq_alcohol
0
m_trans
0
obesity_lev
0
The test results confirmed the absence of any NA values in the dataset, indicating that all variables were successfully converted to numeric format without compromising data integrity.
2.4 Spotting Mistakes and Missing Data
Check for missing values
To ensure data integrity, missing values in the dataset are examined by counting “NA” values in each column, providing a clear view of dataset completeness. The results are presented in a formatted table for easy interpretation, with styling applied for readability and a scrollable box to handle larger datasets. This process facilitates prompt handling of missing data through appropriate strategies.
The analysis confirms that all columns contain complete data, with no missing values identified. This completeness ensures a robust foundation for subsequent analysis, eliminating the need for immediate data cleaning related to missing entries.
Check for duplicates
The dataset is examined for duplicated rows to ensure data integrity and eliminate redundancy. Identifying and addressing duplicates is a crucial step in data preprocessing, as redundant entries can skew analysis results and lead to misleading conclusions. This process involves systematically scanning the dataset for identical rows and quantifying their occurrence.
The detection of 24 duplicated rows in the dataset highlights the need for further preprocessing to ensure data integrity, as these redundant entries could skew analysis if not properly handled.
Code
dataset <- dataset %>%distinct()nrow(dataset)
[1] 2087
Code
any(duplicated(dataset))
[1] FALSE
The dataset was refined by removing duplicate entries to ensure that only unique rows are retained. A verification step confirmed that no duplicates remain, ensuring the dataset’s integrity and reliability for further analysis.
2.5 Listing Anomalies and Outliers
A bar chart was created to visualize the distribution of obesity levels, providing a clear overview of class frequencies within the dataset. Particular attention is given to obesity levels, as this variable serves as the dependent variable in the predictive model to be developed later.
Code
g1 <-ggplot(dataset, aes(x = obesity_lev)) +geom_bar(fill ="skyblue", color ="black") +theme_minimal() +labs(title ="Class Distribution of Obesity Levels",x ="Obesity Level",y ="Count" ) +theme(axis.text.x =element_text(angle =45, hjust =1)) #Adjusted the text for clarityplotly_plot <-ggplotly(g1)plotly_plot
The chart highlights a balanced distribution across all obesity levels, demonstrating the effectiveness of SMOTE in addressing class imbalance. By equalizing the representation of each category, the dataset becomes more reliable for analysis, reducing biases and ensuring a fair evaluation of patterns within the data. On the other hand, the synthetic data introduced by SMOTE may not fully reflect real-world variability, potentially leading to artificial patterns that could affect the interpretability of results.
A density plot was generated to visualize the age distribution across different obesity levels, providing insights into patterns and trends within the data.
Code
g2 <-ggplot(dataset, aes(x = age, fill = obesity_lev)) +geom_density(alpha =0.5) +theme_minimal() +labs(title ="Age Distribution by Obesity Levels",x ="Age",y ="Density",fill ="Obesity Level")plotly_plot1 <-ggplotly(g2)plotly_plot1
This graph provides a detailed view of the age distribution across obesity levels and offers insight into the impact of the SMOTE algorithm in balancing the dataset. The distributions show distinct separation among obesity categories, with younger ages predominantly associated with lower obesity levels (e.g., Insufficient Weight and Normal Weight), while older ages are more prevalent in higher obesity categories (e.g., Obesity Type II and III).
Notably, sharp peaks in the density curves, such as the one around age 30 in “Obesity Type I,” could indicate potential artifacts introduced during the synthetic data generation process. While these patterns align with logical demographic trends, they highlight the need for further validation to ensure that such separations and peaks represent realistic population characteristics rather than biases from data augmentation. Overall, the dataset reflects clear and interpretable patterns, but these observations suggest the importance of cautious interpretation and robust validation in subsequent analyses.
Summary statistics were computed for key variables across obesity levels to identify potential anomalies or patterns, providing a clearer understanding of how age, height, and weight vary within each category.
The summary statistics reveal distinct differences across obesity levels. As expected, weight increases progressively with higher obesity categories, accompanied by slightly larger variations in standard deviation. Interestingly, height remains relatively constant across categories, suggesting it plays a limited role in distinguishing obesity levels. The age distribution shows a notable shift, with younger individuals dominating the lower obesity levels and a broader age range in higher levels, highlighting potential demographic patterns worth further exploration. These insights confirm the logical trends in the dataset, providing confidence in its structure while emphasizing the need for further analysis of these relationships.
Clustering was performed using k-means to explore the dataset’s structure and assess the coherence of the groups, with the silhouette score calculated to evaluate the quality and separation of the clusters.
The mean silhouette score of approximately 0.456 indicates moderate cohesion within clusters and reasonable separation between them. This suggests that the clusters, representing different obesity levels, are distinguishable but not excessively isolated. The result reflects a balance between natural class separability and the effects of data augmentation with SMOTE, which appears to have effectively balanced the dataset without introducing significant distortions. These findings provide confidence in the dataset’s suitability for clustering-based exploration while highlighting the importance of further validation to ensure the robustness of the observed patterns.
2.6 Correlation Analysis
To explore relationships among variables and their association with obesity levels, a correlation matrix was computed. The analysis focuses on identifying the strength and direction of correlations between “obesity_lev” (the dependent variable) and other predictors, such as physical activity, frequency of alcohol consumption, and dietary habits. By ordering variables based on their correlation with “obesity_lev”, the matrix highlights the most influential factors in determining obesity levels. A heatmap visualization was then created to provide an intuitive representation of these relationships, with a gradient color scale indicating the strength of positive and negative correlations. This approach facilitates the identification of key variables for further analysis and modeling.
Code
#Assuming dataset_num is already defined and contains the relevant columnscor_matrix <-cor(dataset_num %>%select("physical_act", "freq_alcohol", "obesity_lev", "age", "weight","height", "family_hist", "caloric_food", "vegetable_food", "food_btw_meals", "use_tech", "ch2o", "m_trans", "smoke","nb_meal_day", "calorie_check", "gender"),use ="complete.obs")#Extract the correlations with 'obesity_lev'cor_with_obesity_lev <- cor_matrix["obesity_lev",]#Order variables by their correlation with 'obesity_lev'ordered_vars <-names(sort(cor_with_obesity_lev, decreasing =TRUE))#Reorder the correlation matrix based on this ordercor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]#Melt the ordered correlation matrix into long formatcor_long <-melt(cor_matrix_ordered)g3 <-ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +geom_tile() +geom_text(aes(label =round(value, 2)), color ="black", size =2.5, vjust =0.5, hjust =0.5) +scale_fill_gradient2(low ="blue", mid ="white", high ="red", midpoint =0) +labs(title ="Correlation Heatmap Ordered by Obesity Level", x ="Variables", y="Variables") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), axis.text.y =element_text(angle =45, vjust =1))plot3 <-ggplotly(g3)plot3
Code
# Create the heatmap with correlation values# Assuming dataset_num is already defined and contains the relevant columnscor_matrix <-cor(dataset_num %>%select("physical_act", "freq_alcohol", "obesity_lev", "age", "weight", "family_hist", "caloric_food", "vegetable_food", "food_btw_meals", "use_tech","ch2o", "height", "calorie_check", "gender"), use ="complete.obs")# Extract the correlations with "obesity_lev"cor_with_obesity_lev <- cor_matrix["obesity_lev",]# Order variables by their correlation with 'obesity_lev'ordered_vars <-names(sort(cor_with_obesity_lev, decreasing =TRUE))# Reorder the correlation matrix based on this ordercor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]# Melt the ordered correlation matrix into long formatcor_long <-melt(cor_matrix_ordered)g4 <-ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +geom_tile() +geom_text(aes(label =round(value, 2)), color ="black", size =2.5, vjust =0.5 , hjust =0.5) +# Center text within tilesscale_fill_gradient2(low ="blue", mid ="white", high ="red", midpoint =0) +labs(title ="Correlation Heatmap Ordered by Obesity Level", x ="Variables", y="Variables") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), axis.text.y =element_text(angle =45, vjust =1) )plot4 <-ggplotly(g4)plot4
The correlation matrices provide valuable insights into the relationships between variables and their association with obesity levels. As expected, weight exhibits a very strong positive correlation with obesity level, reinforcing its central role in defining the target variable. Family history of obesity and caloric food consumption also show moderate positive correlations, highlighting their relevance as predictive factors.
Conversely, variables such as physical activity and food consumption between meals exhibit weak or negative correlations, suggesting that their influence on obesity levels is less pronounced. These patterns align with logical trends but also underscore the need for careful consideration of multicollinearity and the relative importance of variables in predictive modeling. The heatmap’s clear organization of variables by their correlation strength aids in identifying the most impactful factors for further analysis. Overall, the results confirm that the dataset’s structure supports a robust examination of the factors influencing obesity.
3 Exploratory Data Analysis (EDA)
The Exploratory Data Analysis (EDA) phase of the project was designed to uncover meaningful patterns and insights while ensuring the dataset was optimized for analysis. A correlation heatmap was employed early in the process to identify and assess relationships between variables. By comparing the initial and refined versions of the heatmap, we effectively filtered out less relevant variables, allowing the analysis to focus on the most impactful features. This step not only streamlined the dataset but also enhanced its interpretability, ensuring a more targeted exploration of key patterns.
Key variables retained after this step were selected based on their strong correlations with the target variable and their potential relevance to underlying patterns in the data.
The EDA process involved a systematic exploration of the cleaned and refined dataset, utilizing visualization tools to highlight trends, distributions, and potential anomalies.
3.0.0.1 Descriptive statistics and distribution analysis
The initial phase of the Exploratory Data Analysis (EDA) concentrated on examining the most strongly correlated variables: Age, Height, and Weight. These variables were prioritized due to their direct relevance and significant relationships with the target outcomes, as highlighted in the refined correlation heatmap.
The age variable exhibits a right-skewed distribution, with a mean of 24.3 years and a median of 22.78 years, indicating a slight asymmetry toward younger ages. The range spans from 14 to 61 years, though the majority of individuals fall within the 20–30 age group. A standard deviation of 6.35 years reflects moderate variability in age across the dataset. This predominantly young sample may introduce limitations when generalizing findings to older populations, where obesity-related factors might differ significantly.
Age Distribution by Obesity Level
Code
g5 <-ggplot(dataset, aes(x = obesity_lev, y = age, fill = obesity_lev)) +geom_violin(trim =FALSE, alpha =0.6) +geom_boxplot(width =0.1, color ="black", fill ="white") +labs(title ="Age Distribution by Obesity Level", x ="Obesity Level", y ="Age") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))plot5 <-ggplotly(g5)plot5
The violin plot highlights the age distribution across obesity levels, illustrating that individuals with insufficient or normal weight are predominantly younger, with ages concentrated between 14 and 30 years. In contrast, higher obesity levels, such as Obesity Type I and Type II, display a broader age range, peaking around 30–40 years. Severe obesity (Type III) is rare among younger individuals but becomes more prevalent in mid-adulthood. This visualization emphasizes the gradual increase in obesity risk with age, underlining the importance of early intervention, particularly during early and mid-adulthood, when such risks are most pronounced.
Age Distribution with SMOOTH Trend Line for Obesity Probability
Code
g6 <-ggplot(dataset, aes(x = age, y =as.numeric(obesity_lev))) +geom_jitter(alpha =0.3) +geom_smooth(method ="loess", se =FALSE, color ="blue") +labs(title ="Trend of Obesity Level with Age", x ="Age", y ="Obesity Level") +theme_minimal()plot6 <-ggplotly(g6)plot6
Complementing this, the trend line graph further captures the trajectory of obesity levels with age. A sharp rise in obesity is observed from adolescence to early adulthood, peaking in the 25–30 years range. This critical transition phase is likely influenced by lifestyle factors such as reduced physical activity, increased caloric intake, and metabolic changes. After this peak, the trend reveals a gradual decline in obesity levels beyond 30 years, potentially reflecting improved health awareness, dietary adjustments, or a selection bias in older populations. These insights underscore the mid-20s to early-30s as a pivotal stage for targeted obesity prevention and intervention strategies.
g7 <-ggplot(dataset, aes(x = height)) +geom_histogram(bins =20, fill ="purple", color ="black", alpha =0.7) +labs(title ="Height Distribution", x ="Height (m)", y ="Count") +theme_minimal()plot7 <-ggplotly(g7)plot7
The height distribution, as shown in the histogram, follows an approximately normal shape with a slight right skew. Most values range between 1.45m and 1.98m, peaking around 1.8m, which represents the most common height. Both the mean and median are 1.7m, confirming a nearly symmetrical distribution. The standard deviation of 0.09 indicates low variability, and no extreme outliers are observed, highlighting a realistic and consistent dataset for height.
Box Plot of Height by Obesity Level
Code
g8 <-ggplot(dataset, aes(x = obesity_lev, y = height, fill = obesity_lev)) +geom_violin(alpha =0.6) +labs(title ="Height Distribution by Obesity Level", x ="Obesity Level", y ="Height") +theme_minimal() +theme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1))plot8 <-ggplotly(g8)plot8
The violin plot further explores the height distribution across obesity levels. Each category exhibits relatively low variability, with overlapping ranges across groups. Insufficient and Normal Weight categories have slightly narrower distributions, centered around 1.7m. As obesity levels increase, from Obesity Type I to Type III, the distributions remain consistent, indicating that height does not significantly vary with obesity classification. These findings suggest that while height remains a stable feature, weight likely plays a more decisive role in determining obesity levels.
g9 <-ggplot(dataset, aes(x = weight, fill = gender)) +geom_density(alpha =0.5) +labs(title ="Density Plot of Weight by Gender", x ="Weight", y ="Density") +scale_fill_manual(values =c("pink", "lightblue"), name ="Gender", labels =c("Female", "Male")) +theme_minimal()plot9 <-ggplotly(g9)plot9
The density plot highlights distinct differences in weight distribution between genders. Females generally exhibit lower weights, with a peak around 70 units, whereas males show peaks at 85 and 115 units, reflecting a tendency toward higher weights. An overlapping region between 80 and 90 units indicates common weight ranges for both genders, though the distinct peaks underscore gender-based differences. Weight in the dataset ranges from 39 to 173 units, with an average of 86.6 units, a median of 83 units, and a standard deviation of 26.2, indicating moderate variability.
Ridgeline Plot of Weight by Obesity Level.
Code
ggplot(dataset, aes(x = weight, y = obesity_lev, fill = obesity_lev)) +geom_density_ridges(scale =0.9, alpha =0.6) +labs(title ="Ridgeline Plot of Weight by Obesity Level", x ="Weight", y ="Obesity Level") +theme_minimal() +theme(legend.position ="none")
Code
# can't seem to make the interractive plot work
The ridgeline plot further illustrates the relationship between weight and obesity levels. As obesity levels increase, the weight distribution shifts consistently toward higher values. Categories such as “Insufficient Weight” and “Normal Weight” cluster at lower ranges, while higher obesity types (I, II, and III) peak at significantly greater weights. This clear progression confirms a strong positive association between weight and obesity levels, reinforcing the centrality of weight in obesity classification. The dataset’s average weight remains at 86.6 units with a standard deviation of 26.6, capturing the variability across different obesity categories.
3.0.0.1.4 Height and Weight
Researching deeply the relationship between height and weight, trends across obesity levels were examined using scatter plots, providing critical insights that reinforce theoretical expectations while contextualizing weight variations within height ranges for different obesity classifications.
Scatter Plot (height vs weight), colored by obesity level
Code
g11 <-ggplot(dataset, aes(x = height, y = weight, color = obesity_lev)) +geom_point(alpha =0.7) +geom_smooth(method ="lm", se =FALSE, aes(group = obesity_lev)) +# Adds a trend line for each obesity levelggtitle("Scatter Plot of Weight vs Height by Obesity Level") +theme_minimal() +labs(x ="Height", y ="Weight", color ="Obesity Level")plot11 <-ggplotly(g11)plot11
Facet Grid for Height and Weight by Obesity Level
Code
g12 <-ggplot(dataset, aes(x = height, y = weight)) +geom_point(alpha =0.7, aes(color = obesity_lev)) +facet_wrap(~ obesity_lev) +ggtitle("Facet Grid of Weight and Height by Obesity Level") +theme_minimal() +labs(x ="Height", y ="Weight", color ="Obesity Level") +theme(legend.position ="none")plot12 <-ggplotly(g12)plot12
The first scatter plot presents an overview, illustrating the general trend of increasing weight with height, stratified by obesity levels. To better isolate and visualize these individual trends, the initial graph is expanded into a facet grid, offering a clearer perspective on the separate trends within each obesity category and highlighting distinct relationships and ranges.
Correlation between height and weight
Code
correlation_height_weight <-cor(dataset$height, dataset$weight, use ="complete.obs")correlation_height_weight
[1] 0.457468
The observed correlation between height and weight (r = 0.463) aligns with findings in existing literature, confirming a moderate positive relationship and reinforcing the expectation that taller individuals generally weigh more, though the strength of this association varies slightly across obesity levels.
With the analysis of age, height, and weight completed, attention shifts to exploring the remaining variables in the dataset. These variables, while less directly correlated with the target outcomes, offer critical insights into behavioral, lifestyle, and environmental factors that may influence obesity levels.
3.0.0.1.5 Food between meals
Code
# Dodged Bar Chart for food_btw_meals by obesity levelsg13 <-ggplot(dataset, aes(x = food_btw_meals, fill = obesity_lev)) +geom_bar(position ="dodge", color ="black") +ggtitle("Dodged Bar Chart for Food Between Meals by Obesity Levels") +labs(x ="Food Between Meals", y ="Count", fill ="Obesity Levels") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =14))plot13 <-ggplotly(g13)plot13
Code
# Stacked Bar Chart of Food Between Meals by Obesity Level (Proportions within each Obesity Level)g14 <-ggplot(dataset, aes(x = obesity_lev, fill = food_btw_meals)) +geom_bar(position ="fill") +# Stacked bar chart with proportionsscale_y_continuous(labels = scales::percent_format(accuracy =1)) +# Format y-axis as percentagesggtitle("Proportion of Food Between Meals Across Obesity Levels") +# Shortened and clear titlelabs(x ="Obesity Levels", y ="Proportion (%)", fill ="Food Between Meals") +# Correct axis and legend labelstheme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis text for readabilityplot.title =element_text(hjust =0.5, size =14) # Center and style the title )plot14 <-ggplotly(g14)plot14
The charts provide a clear illustration of how the frequency of eating between meals varies across obesity levels. The most dominant behavior across all categories is “Sometimes,” which peaks in intermediate levels like Normal Weight and Overweight Level I, reflecting a common pattern of moderate snacking. However, as obesity levels increase to Obesity Types I–III, the responses for “Frequently” and “Always” diminish, while “Sometimes” becomes even more prevalent. This shift could indicate that higher obesity levels are more associated with habitual moderate snacking rather than excessive meal-snacking frequency. On the other hand, “No” responses remain negligible across all obesity levels, suggesting that eating between meals is almost universal in this population. This pattern underscores the importance of examining not just the frequency but also the quality and context of snacking as potential contributors to obesity progression.
3.0.0.1.6 High-caloric food consumption
Code
# Grouped Bar Chart of High-Caloric Food by Obesity Level (Counts)g16 <-ggplot(dataset, aes(x = obesity_lev, fill = caloric_food)) +geom_bar(position ="dodge",color ="black" ) +ggtitle("Grouped Bar Chart of High-Caloric Food Consumption Across Obesity Levels") +labs(x ="Obesity Levels", y ="Count", fill ="High-Caloric Food Consumption") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.5, size =14) )plot16 <-ggplotly(g16)plot16
The grouped bar chart highlights a clear trend of increased high-caloric food consumption as obesity levels rise. High-caloric food consumption (“yes”) dominates across all categories, surpassing 75% of responses and becoming nearly universal in higher obesity levels (Obesity Type I–III). In contrast, “no” responses are more prominent in lower categories like Insufficient Weight and Normal Weight but remain relatively infrequent.
More precisely, a notable 88.4% of participants report frequent consumption of high-calorie foods, a behavior strongly associated with weight gain. This underscores the critical importance of dietary interventions aimed at reducing high-calorie intake to address obesity progression effectively.
3.0.0.1.7 Alcohol consumption
Frequence in consumption of alcohol
Code
# Filter out "Always" responses from the datasetfiltered_dataset <- dataset %>%filter(freq_alcohol !="Always")# Dodged Bar Chart for freq_alcohol by Obesity Levels (excluding "Always")g17 <-ggplot(filtered_dataset, aes(x = freq_alcohol, fill = obesity_lev)) +geom_bar(position ="dodge", color ="black") +ggtitle("Dodged Bar Chart for Alcohol Consumption by Obesity Levels") +labs(x ="Alcohol Consumption Frequency", y ="Count", fill ="Obesity Levels") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =14)) # Center and style the titleplot17 <-ggplotly(g17)plot17
Regarding alcohol consumption, the chart shows that “Sometimes” is the dominant alcohol consumption frequency across all obesity levels, particularly in Normal Weight, Overweight Level I, and II categories. As obesity increases, “Frequently” becomes slightly more prominent, especially in Obesity Type III, while “No” responses decrease, being more common in lower obesity levels such as Insufficient and Normal Weight. The “Always” responses are excluded from this chart due to their near absence in the dataset, highlighting that excessive alcohol consumption is rare. This trend underlines the potential relationship between moderate-to-frequent alcohol consumption and higher obesity levels, emphasizing its importance for obesity-related behavioral research.
Code
# Prepare the data summary for 'Sometimes' and 'No' responsesdata_summary <- dataset %>%filter(freq_alcohol %in%c("Sometimes", "No")) %>%group_by(obesity_lev, freq_alcohol) %>%summarise(count =n(), .groups ="drop") %>%group_by(obesity_lev) %>%mutate(total =sum(count),proportion = count / total ) %>%ungroup()# Visualization with updated titleg18 <-ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = freq_alcohol, color = freq_alcohol)) +geom_line(linewidth =1.2) +geom_point(size =3) +scale_y_continuous(labels = scales::percent_format(accuracy =1)) +# Format y-axis as percentagesggtitle("Proportion of 'Sometimes' and 'No' Alcohol Responses by Obesity Level") +labs(x ="Obesity Level", y ="Proportion (%)", color ="Alcohol Frequency") +scale_color_manual(values =c("No"="purple", "Sometimes"="gold")) +# Improved color schemetheme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.5, size =14), # Center and style titlelegend.position ="top" )plot18 <-ggplotly(g18)plot18
To better illustrate the trends in alcohol consumption frequency across obesity levels, this graph was created to highlight the shifting proportions of individuals consuming alcohol “Sometimes” and abstaining (“No”). The proportion of individuals who drink alcohol “Sometimes” shows a steady increase with higher obesity levels, peaking in Obesity_Type_III. Conversely, the proportion of those who abstain from alcohol decreases as obesity levels rise, suggesting an inverse relationship between abstention and obesity severity.
This pattern raises questions about the potential interaction between alcohol consumption frequency and caloric food preferences, as both behaviors appear to be associated with higher obesity levels. Investigating this interaction could provide insights into whether a combination of moderate alcohol consumption and high-calorie food preferences exerts a compounded effect on obesity risk. Understanding these combined lifestyle factors could inform strategies aimed at mitigating obesity progression more effectively.
3.0.0.1.8 Daily Calorie Monitoring
Code
# Dodged Bar Chart for calorie_check by Obesity Levelsg19 <-ggplot(dataset, aes(x = calorie_check, fill = obesity_lev)) +geom_bar(position ="dodge", color ="black") +ggtitle(" Dodged Bar Chart for the check of the calories by Obesity Levels") +labs(x ="High-Caloric Food Consumption", y ="Count", fill ="Obesity Levels") +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =14)) # Center and style the titleplot19 <-ggplotly(g19)plot19
Code
data_summary <- dataset %>%group_by(obesity_lev, calorie_check) %>%summarise(count =n(), .groups ="drop") %>%mutate(total =sum(count), proportion = count / total)# Proportion of Calorie Checking by Obesity Levelg20 <-ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = calorie_check, color = calorie_check)) +geom_line(linewidth =1.2) +geom_point(size =3) +scale_y_continuous(labels = scales::percent) +scale_color_manual(values =c("no"="lightcoral", "yes"="lightblue")) +labs(title ="Proportion of Calorie Checking by Obesity Level", x ="Obesity Level", y ="Proportion", color ="Calorie Check") +theme_minimal() +theme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1))plot20 <-ggplotly(g20)plot20
The Dodged Bar Chart highlights two main trends regarding calorie-checking behavior across obesity levels: a significant increase in “Yes” responses as obesity levels rise, particularly from Overweight Level II onward, and a decrease in “No” responses, which are more prevalent in lower obesity levels like Normal Weight and Insufficient Weight. The proportion graph simplifies these trends by clearly illustrating the proportional shift between “Yes” and “No” responses, making the contrast between lower and higher obesity levels more visually apparent. Together, these visualizations emphasize a potential association between obesity severity and an increased tendency to check calorie intake, suggesting heightened dietary awareness in higher obesity categories.
3.0.0.1.9 Vegetable consumption
Code
g21 <-ggplot(dataset, aes(x = vegetable_food)) +geom_histogram(aes(y =after_stat(density)), bins =30, fill ="lightgreen", color ="black", alpha =0.6) +geom_density(color ="darkgreen", linewidth =1) +ggtitle("Histogram and Density of Vegetable Food Consumption") +theme_minimal() +labs(x ="Vegetable Food Consumption", y ="Density")plot21 <-ggplotly(g21)plot21
Code
g22 <-ggplot(dataset, aes(x = weight, y = vegetable_food, color = obesity_lev)) +geom_point(alpha =0.6) +geom_smooth(method ="loess", se =FALSE, color ="black") +labs(title ="Scatterplot of Weight vs Vegetable Food Consumption", x ="Weight", y ="Vegetable Food Consumption") +theme_minimal() +coord_cartesian(xlim=c(40, 135), ylim=c(2, 3))plot22 <-ggplotly(g22)plot22
The scatterplot provided with the trend line illustrates a distinct, non-linear relationship: vegetable consumption initially decreases as weight increases but then begins to rise again at higher weight levels.
This pattern suggests that individuals with lower weight, particularly those in the Insufficient Weight and Normal Weight categories, tend to report higher vegetable consumption. As weight progresses toward the Overweight categories, vegetable consumption decreases slightly, indicating a possible reduction in healthy dietary habits. However, at the upper end of the weight spectrum, corresponding to Obesity Type II and Obesity Type III, vegetable consumption increases again, potentially due to dietary interventions or awareness in this group.
The trend reveals two possible key insights:
A dip in vegetable consumption occurs in intermediate weight ranges, aligning with the overweight population.
The sharp increase in vegetable consumption among the most obese individuals may reflect lifestyle adjustments prompted by health concerns or medical advice.
3.0.0.1.10 Physical activity
Code
g23 <-ggplot(dataset, aes(x = physical_act)) +geom_histogram(aes(y = ..density..), bins =30, fill ="skyblue", color ="black", alpha =0.6) +geom_density(color ="darkblue", linewidth =1) +ggtitle("Histogram and Density of Physical Activity") +theme_minimal() +labs(x ="Physical Activity", y ="Density")plot23 <-ggplotly(g23)plot23
The histogram and density plot reveal that physical activity levels have distinct peaks at 0, 1, 2, and 3, suggesting that these values are common reported levels. Intermediate values, likely due to synthetic data or SMOTE, are also present but less frequent.
Violin plot by category
Code
g24 <-ggplot(dataset, aes(x = obesity_lev, y = physical_act, fill = obesity_lev)) +geom_violin(trim =FALSE, alpha =0.6) +geom_boxplot(width =0.1, color ="black", fill ="white") +ggtitle("Violin Plot of Physical Activity by Obesity Level") +labs(x ="Obesity Level", y ="Physical Activity") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))plot24 <-ggplotly(g24)plot24
Physical activity levels show a slight decline as obesity levels increase, particularly evident in the narrowing distributions and lower medians observed for Obesity Type II and Obesity Type III categories. In contrast, the Insufficient Weight and Normal Weight groups exhibit higher physical activity levels, as reflected by their broader and more symmetrical distributions.
The graph reveals a distinct trend: individuals in lower obesity categories engage in more physical activity compared to those in higher obesity categories. This trend suggests an inverse relationship between physical activity and obesity levels.
3.0.0.1.11 Water consumption
Code
g25 <-ggplot(dataset, aes(x = ch2o)) +geom_histogram(aes(y = ..density..), bins =30, fill ="skyblue", color ="black", alpha =0.6) +geom_density(color ="darkblue", size =1) +ggtitle("Histogram and Density of Comsumption of Water") +theme_minimal() +labs(x ="CH2O", y ="Density")plot25 <-ggplotly(g25)plot25
This histogram and density plot of daily water consumption (CH2O) shows a clear peak at 2 liters, indicating that most individuals consume around this amount. This aligns with scientific literature, which generally recommends an average daily water intake of about 2 liters for optimal health.
Trend Line of Weight vs Water Consumption
Code
# Scatterplot with a LOESS trend lineg26 <-ggplot(dataset, aes(x = weight, y = ch2o, color = obesity_lev)) +geom_point(alpha =0.6) +geom_smooth(method ="loess", se =FALSE, color ="black") +labs(title ="Scatterplot of Weight vs Water Consumption", x ="Weight", y ="Water Consumption (ch2o)") +theme_minimal() +coord_cartesian(xlim=c(35, 135))plot26 <-ggplotly(g26)plot26
The scatterplot visualizes the relationship between weight and water consumption (ch2o), categorized by obesity levels. The trend line reveals a slightly increasing pattern of water consumption as weight increases, though the relationship is relatively weak and mostly linear.
This pattern suggests that individuals with Insufficient Weight and Normal Weight categories generally report slightly lower water consumption compared to individuals in the higher weight categories, such as Obesity Type II and III. The increase in water consumption among higher weight groups could indicate attempts to adopt healthier habits or increased hydration needs due to larger body sizes. However, the relatively flat trend across most weight ranges suggests that water consumption does not vary dramatically across different weight categories, highlighting a potential area for targeted interventions to promote hydration as a component of healthy dietary behavior.
3.0.0.1.12 Technology utilization
Density of Use of Technology by Obesity Level
Code
g28 <-ggplot(dataset, aes(x = use_tech, fill = obesity_lev)) +geom_density(alpha =0.5) +labs(title ="Density of Use of Technology by Obesity Level", x ="Use of Technology", y ="Density") +theme_minimal()plot28 <-ggplotly(g28)plot28
This density plot provides a perspective on the use of technology across different obesity levels. A striking feature is the sharp, dominant peak in Obesity Type III (yellow) around the value of 1. This pattern diverges notably from the smoother and more evenly distributed curves seen in other obesity categories, suggesting a unique behavioral trend in this group.
The peak indicates a strong clustering of individuals in Obesity Type III who report moderate use of technology, which may reflect consistent engagement with technology-based activities such as sedentary work, entertainment, or even health-monitoring applications. In contrast, other obesity categories, such as Obesity Type II and Overweight Level II, exhibit more balanced distributions without a single dominant peak, hinting at more varied technology usage patterns.
This observation raises interesting questions about the role of technology in shaping lifestyle behaviors in Obesity Type III individuals. It may point to a reliance on technology that correlates with a sedentary lifestyle, a known risk factor for obesity. Alternatively, it could reflect targeted interventions or habits specific to this group.
The Exploratory Data Analysis (EDA) phase provided a comprehensive understanding of the dataset, offering key insights into the relationships between various behavioral, lifestyle, and demographic factors and obesity levels. By focusing on critical variables the EDA revealed patterns and trends that are integral to the modeling process.
4 Analysis
The analysis phase is dedicated to the development, refinement, and comprehensive evaluation of the predictive models, meticulously designed to directly address the previously defined research questions.
4.1 Methods
The modeling process is structured to address the two key research questions:
identifying the most significant lifestyle and behavioral factors contributing to obesity in Mexico, Peru, and Colombia;
predicting whether a person will be obese based on some given combinations of factors.
4.1.1 Logistic Regression Model
To accurately address the key research questions, a logistic regression model will be employed to estimate the probability of individuals belonging to a categorie: obese or not obese. Weight and height will be excluded as predictors in the model because they are directly used to calculate BMI, which serves as the basis for the obesity levels categorized in the dataset. Including these variables would create a dependency between the predictors and the target variable, potentially biasing the analysis. By excluding weight and height, the focus shifts to behavioral and lifestyle factors, such as dietary habits, physical activity, and demographic characteristics, to better understand their influence on obesity risk.
While logistic regression provides a clear and interpretable framework for estimating probabilities, it inherently limits the analysis to a binary classification. This restriction prevents the exploration of the full spectrum of obesity levels, such as Obesity Type I, II, or III, as classified in the dataset. Despite this limitation, logistic regression is a robust method for quantifying the relationships between independent variables and the binary outcome. Feature selection techniques will ensure that only the most relevant predictors are retained, and the model’s performance will be rigorously evaluated using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC, ensuring reliable and actionable insights.
4.1.2 Insights and Limitations
Regression analysis helps us understand how predictors influence outcomes, with logistic regression classifying individuals as obese or not obese. As already discussed in the previous sections, the dataset offers a mix of advantages and challenges: synthetic data ensures balanced representation but lacks the complexity of real-world patterns, while user-collected data adds variability but is prone to biases. Logistic regression simplifies the analysis by focusing on binary outcomes, leaving out the nuanced gradations of obesity, and assumes linearity, which may not fully capture complex relationships. Despite these limitations, the model offers insights into obesity risk, serving as a valuable exercise and foundation for future data explorations, even if not directly applicable to real-world scenarios.
4.2 Objectives of the Selected Method
4.2.1Logistic Regression Model Development
Data Loading and Processing
To align with the requirements of a logistic regression model, it was necessary to modify the dataset’s target variable. The original variable, obesity level, was a multi-class categorical variable representing varying degrees of obesity and non-obesity. Since logistic regression is designed for binary classification, the target variable was converted into a binary format. Individuals with a BMI ≥ 30 were classified as obese (1), while others were classified as non-obese (0). This transformation ensured compatibility with the logistic regression framework. Following this adjustment, the target variable was converted into a factor, and the dataset was reviewed for consistency and readiness for analysis.
Twelve predictors associated with obesity-related behaviors, dietary habits, physical activity, and lifestyle factors were selected for analysis. These variables, along with the binary target variable Obesity (1 = obese, 0 = not obese), formed the dataset for logistic regression modeling. The dataset was reviewed to ensure correct structure and readiness for analysis.
Three regression models were employed to ensure a systematic and robust approach to predictor selection and model development. The null model, containing only the intercept, served as a baseline to represent predictions without the influence of any predictors. This provided a reference point to evaluate how much additional explanatory power was gained by including predictors.
The full model, incorporating all predictors, represented the maximum complexity allowable within the dataset. This model helped understand the potential contribution of each variable but carried the risk of overfitting due to its complexity.
The stepwise model, guided by the Akaike Information Criterion (AIC), balanced the simplicity and performance of the model. By iteratively evaluating the inclusion or exclusion of predictors, the stepwise procedure identified the subset of variables that significantly contributed to explaining the outcome while minimizing unnecessary complexity. This process ensured that the final model retained only the most relevant predictors, achieving optimal fit and generalizability. Using these three models allowed for a thorough comparison and the development of a parsimonious and effective predictive model.
Code
full_model <-glm(Obesity ~ ., data = model_data, family = binomial)null_model <-glm(Obesity ~1, data = model_data, family = binomial)stepwise_model <-step(null_model, scope =list(lower = null_model, upper = full_model), direction ="both", trace =FALSE)
Presented below is a comprehensive overview of the logistic regression models.
Coefficients of the Stepwise Logistic Regression Model
Predictor
Estimate
Std. Error
z value
Pr(>|z|)
(Intercept)
-8.7956227
0.7141792
-12.3157087
0.0000000
family_histyes
3.6796264
0.3746724
9.8209158
0.0000000
food_btw_mealsFrequently
-2.0244197
0.5873593
-3.4466462
0.0005676
food_btw_mealsNo
-0.6183034
0.8999773
-0.6870211
0.4920694
food_btw_mealsSometimes
1.3604530
0.4559138
2.9840138
0.0028449
caloric_foodyes
2.1168040
0.2566262
8.2485893
0.0000000
vegetable_food
0.8886854
0.1115908
7.9637892
0.0000000
m_transBike
0.3743789
1.4395954
0.2600584
0.7948187
m_transMotorbike
1.6429866
0.9411341
1.7457517
0.0808541
m_transPublic_Transportation
0.6015745
0.1282129
4.6919960
0.0000027
m_transWalking
-1.8531534
0.6503271
-2.8495711
0.0043778
calorie_checkyes
-2.6730603
0.6307480
-4.2379214
0.0000226
physical_act
-0.3462794
0.0705278
-4.9098293
0.0000009
use_tech
-0.4174300
0.0956562
-4.3638577
0.0000128
smokeyes
1.0211075
0.4549016
2.2446777
0.0247888
ch2o
0.1679017
0.0994235
1.6887520
0.0912670
Evaluation
To evaluate the stepwise-selected model, predicted probabilities of obesity were generated for all individuals. These probabilities were converted into binary classifications using a threshold of 0.5. A confusion matrix was constructed to assess the model’s performance, providing key metrics such as accuracy, sensitivity, specificity, precision, and F1-score.
# Compute ROC curve using the actual class labels and predicted probabilitiesroc_curve <- pROC::roc(model_data$Obesity, predicted_probs)auc_value <- pROC::auc(roc_curve)print(auc_value)
Area under the curve: 0.8556
Additionally, the ROC curve and AUC were calculated to further evaluate the model’s discriminative ability. The ROC curve visualizes the trade-off between sensitivity and specificity, while the AUC quantifies the model’s ability to distinguish between obese and non-obese individuals.
Results Visualization
To assess the distribution of predicted probabilities, a scatter plot was created with observations color-coded by their actual class. This visualization provides a clear overview of the model’s predictions and potential misclassifications.
Code
plot(predicted_probs, col =ifelse(model_data$Obesity ==1, "blue", "red"), pch =16, xlab ="n° Observation", ylab ="Predicted Probability", main ="Predicted Probabilities of Obesity", cex =0.6)legend("bottomright", legend =c("Obese", "Not Obese"), col =c("blue", "red"), pch =16)
For additional clarity, the ROC curve was plotted to visually represent the model’s performance.
To test the model’s ability to predict the probability of individuals becoming obese, six distinct profiles were created, representing a diverse range of lifestyles. Each profile was carefully designed to highlight specific behavioral, dietary, and lifestyle patterns.
The first individual represents a high-risk case for obesity. This person has a family history of being overweight, frequently consumes high-calorie foods and snacks, and eats very few vegetables. They have five meals a day, drink only 0.5 liters of water daily, and do no physical activity. Additionally, they spend 10 hours a day using technology, consume alcohol consistently, and rely primarily on public transportation for mobility.
The second individual exemplifies a very healthy lifestyle. They have no family history of being overweight, rarely consume high-calorie foods or snacks, and eat a large amount of vegetables. Their diet consists of very few meals per day, complemented by a high water intake of 4 liters daily. They do not monitor calorie intake but engage in physical activity five times a week. They walk as their primary mode of transportation, do not consume alcohol, and spend only 0.5 hours daily using technology.
The third individual exhibits a balanced lifestyle but shows some risk factors. This person has a family history of being overweight, frequently consumes snacks and high-calorie foods, and eats a moderate amount of vegetables. They have three meals a day, drink 1 liter of water, and monitor their calorie intake. However, they engage in physical activity only once a week, use technology for 8 hours daily, use motorbike as transportation vehicle, and occasionally consume alcohol.
The fourth individual is physically active and health-conscious. They have no family history of being overweight, do not frequently consume high-calorie foods, but snack occasionally. They eat a lot of vegetables, have a small number of meals per day, and drink 3 liters of water daily. They do not monitor calorie intake but exercise three times a week and use a bicycle for transportation. They consume alcohol frequently but spend only an hour daily using technology.
The fifth individual represents another high-risk case due to a sedentary lifestyle. They have a family history of being overweight, frequently consume high-calorie foods and snacks, and eat very few vegetables. They have four meals a day, drink 2 liters of water, and do no physical activity. They spend 6 hours daily using technology, consume alcohol moderately, and rely on public transportation.
The sixth individual leads a very active lifestyle but has some risk factors due to alcohol and transportation choices. They have no family history of being overweight, do not frequently consume high-calorie foods or snacks, and eat a large amount of vegetables. They have two meals per day, drink 1.5 liters of water, and do not monitor calorie intake. However, they engage in physical activity four times a week, use a motorbike for transportation, do not consume alcohol, and spend 2 hours daily using technology.
These six profiles were designed to test the model’s capacity to handle a wide variety of real-world scenarios, ensuring it can effectively predict obesity probabilities across diverse populations.
So far, we have conducted a comprehensive exploration and preparation of our dataset, focusing on understanding the influence of lifestyle factors on obesity within a sample from Mexico, Peru, and Colombia. The dataset, which was pre-processed with SMOTE to address class imbalance, has provided us with balanced obesity categories, facilitating an in-depth analysis of key variables such as eating habits, physical activity, and alcohol consumption. Through correlation analysis, we identified the variables with the strongest associations to obesity levels, helping to guide our selection of factors for inclusion in the next modeling phase. Additionally, we have thoroughly cleaned and structured the data, renaming variables for clarity, formatting categorical variables, and removing duplicates to ensure a solid foundation for robust modeling.
The next steps involve constructing regression models to analyze the relationships and predictive power of these selected factors on obesity levels. Specifically, we will develop two versions of the model—one that includes extreme values and one that excludes them—to evaluate the impact of outliers on model accuracy and stability. Key metrics such as R², P-values, and VIF will be used to confirm the reliability of the model and address potential multicollinearity issues. Following this, we will build and fine-tune a predictive model using metrics like Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R² to validate and enhance performance.
These efforts will culminate in a final report that, while primarily an exercise and not applicable in real-world contexts, highlights our findings and offers insights into the most influential lifestyle factors affecting obesity. This analysis aims to provide actionable recommendations within a simulated scenario, illustrating how data-driven insights could support public health strategies focused on obesity reduction.
4.4 Next Steps
Outline the next steps planned for completing the project, such as refining analyses, adding new methods, or addressing outstanding data issues.
4.5 Final Thoughts
Briefly reflect on any challenges or limitations encountered so far and how these might be addressed in the final report.
Source Code
---title:Project Update Report (Group G): Code and Structureauthor: - Alessandro Pizzi - Andrea Lovato - Ayman El Abed - Illia Dorofieievinstitute: University of Lausannedate: todaytitle-block-banner: "#0095C8" # chosen for the university of lausannetoc: truetoc-location: rightformat: html: number-sections: true html-math-method: katex self-contained: true code-overflow: wrap code-fold: true code-tools: true include-in-header: # add custom css to make the text in the `</> Code` dropdown black text: | <style type="text/css"> .quarto-title-banner a { color: #000000; } </style> pdf: # use this if you want to render pdfs instead include-in-header: # wrapping the code also in the pdf (otherwise, it overflows) text: | \usepackage{fvextra} \DefineVerbatimEnvironment{Highlighting}{Verbatim}{ commandchars=\\\{\}, breaklines, breaknonspaceingroup, breakanywhere }abstract: | Obesity has become a global health crisis, contributing to rising rates of non-communicable diseases and placing significant strain on healthcare systems worldwide. In this study, we explore the factors influencing obesity in Mexico, Peru, and Colombia through an analysis of a mixed dataset consisting of 77% synthetically generated data and 23% real-world data from 498 participants. Using data cleaning, visualization, and advanced modeling techniques, we identify key lifestyle and behavioral contributors to obesity, including dietary habits, physical activity, and demographic factors. The study employs linear regression to predict BMI and logistic regression to classify obesity, providing valuable insights into the relationship between these factors and obesity levels. While the findings are limited by the use of synthetic data and a non-representative sample, they underscore the importance of data-driven approaches in addressing public health challenges. This report aims to demonstrate the application of theoretical knowledge in a simulated environment and lay the groundwork for future studies targeting obesity reduction strategies.---```{r, message=FALSE, warning=FALSE}#| label: setup#| echo: false#| message: false# loading all the necessary packagessource(here::here("src", "setup.R"))```{{< include sections/_introduction.qmd >}}{{< include sections/_data.qmd >}}{{< include sections/_eda.qmd >}}{{< include sections/_analysis.qmd >}}{{< include sections/_conclusion.qmd >}}